In [1]:
import pandas as pd # our core data analysis toolkit
import numpy as np
import pylab as pl # plotting libraries
from ggplot import *
from pandas import read_json # a function for reading data in JSON files
# This option shows our plots directly in iPython Notebooks
%matplotlib inline
# This option gives a more pleasing visual style to our plots
pd.set_option('display.mpl_style', 'default')
# The location of our playtest data file
filepath = "2014-05-13 makescape playtest.json"
In [2]:
def loadDataSortedByTimestamp(filepath):
x = read_json(filepath)
x = x.sort(columns='timestamp')
x.index = range(0, len(x))
return(x)
ms = loadDataSortedByTimestamp(filepath)
Now that our data is loaded with the variable ms
(I chose it as an abbreviation of MakeScape), let's look at it and make sure it's sane. One of the first things I'll do is check the list of columns that our data comes with.
In [3]:
ms.columns
Out[3]:
Whoa! That is alot of columns. 33 columns, to be exact. We can check that by calling Python's function for determining the length of a collection:
In [4]:
len(ms.columns)
Out[4]:
But we should also check how many rows (in this case, how many distinct events) we have in our dataset.
In [5]:
len(ms) # returns 8505
Out[5]:
In [6]:
columns = ['key', 'timestamp']
ms.head(n=5)[columns]
Out[6]:
What's less-than-helpful right now is that those timestamps are just raw integers. We want to make sure those integers actually represent times when data could reasonably have been collected (and not, say, January of the year 47532, which actually happened once).
Thankfully, pandas comes with a function that can convert UNIX Epoch Time integers into human-recognizable dates. In this case, what we'll do is create a new column called human-readable-timestamps
by applying the pandas Timestamp()
function to our existing integers. Then we'll check the data.
In [7]:
ms['human-readable-timestamp'] = ms.timestamp.apply(lambda x: pd.Timestamp(x, unit='ms'))
columns = ['key', 'timestamp', 'human-readable-timestamp']
ms[columns].head()
Out[7]:
So, it seems like we have way too many MakeConnectComponent
events. Earlier, I explained that we're averaging more than one MakeConnectComponent
event per second. But what if we wanted to think about whether that average really describes a typical time slice of our data? In other words, we might want to know how our MakeConnectcomponent
events are distributed over time.
One way to think about that distribution is to ask: when do our MakeConnectComponent
events occur over time? Below, I'm going to use the ggplot
package to look at the cumulative distribution of the connection events. The syntax may seem wonky and complicated at first, but it's actually an elegant implementation of ggplot2, itself an implementation of Leland Wilkinson's Grammar of Graphics. If at first you're stymied by it, don't worry. I'll try to help break it down for you.
First, we're doing some basic manipulation to get a cumulative sum column. This is actually so dumb I'm almost embarrased. I create a column by applying a lambda function to the timestamp
column that always returns 1, then I just sum cumulatively over that column. Lastly, I apply another lambda function to format our timestamp (currently an integer) into a nicer-formatted timestamp.
I'll break these plotting lines down one major line at a time:
ggplot
object, which is essentially the basic kind of object from whence all plots are constructed in ggplot. aes()
just stands for aesthetic mapping, where I'm telling ggplot how to map data to graphical features. In this case, I'm saying "map the values in the timestamp1
column to the x position of this plot, and map the values of cumulativeCount
to the y position. It may seem trivial now, but the power of aesthetic mappings like this is that I can also map quantities (or categories) to other graphical properties, for example mapping other columns in my dataframe to the aesthetic properties of color or shape. For now, we'll just stick with mapping quantities to cartesian x and y coordinates. When creating the ggplot
object, I also tell it what dataframe I'm talking about, so the local names of timestamp1
and cumulativeCount
make sense in scope.geom_point()
, which would give us a bivariate scatterplot instead of a lineplot.ggtitle()
, xlab()
, and ylab()
to set the text for the plot title and axis labels.print()
call to make sure my plot shows up in my interactive session.ggsave()
to save my plot directly to a file. ggsave()
is smart and it detects the desired type of output file based on the suffix you pass in as a filename. In this case I'm using the .PNG format for images, but if I wanted an infinitely scalable image I could have used a .PDF extension.
In [8]:
# Manipulating the data to get a cumulative sum
# and nicely formatted timestamps
connectionEvents = ms[ms.key == 'MakeConnectComponent']
connectionEvents['cumulativeCount'] = connectionEvents.timestamp.apply(lambda x: 1).cumsum()
connectionEvents['timestamp1'] = connectionEvents.timestamp.apply(lambda x: pd.Timestamp(x, unit='ms'))
# Creating the basic plot
p = ggplot(aes(x='timestamp1',
y='cumulativeCount'),
data=connectionEvents)
p = p + geom_line()
p = p + ggtitle('Cumulative Distribution of MakeConnectComponent Events')
p = p + xlab('Time')
p = p + ylab('Event Count')
# Showing the plot
print(p)
# Saving the plot
ggsave(plot=p,
filename='cumulativeDistributionOfMakeConnectComponent1.png')
This plot has a number of interesting features. Easily the most salient feature is the giant flat-line in the center. It looks like there were effectively no MakeConnectComponent
events between between 2000 hrs on the first day and 1400 hrs on the second day. And that seems entirely reasonable: the game likely would have been shut off during the night, then fired up again for testing the next day.
The challenge is that on either side of the flatline the curves are quite steep. That means there could be a fair amount of information hiding in the areas of the plot where there was activity, but it's up to us to extract out that long, boring middle portion where nothing is happening. How can we do that?
Well, suppose what we wanted was to create two new plots, call them Day 1 and Day 2, that have just the interesting parts: the parts where the slope of the cumulative distribution is nonzero. (Note that when the slope is nonzero, that's when action is happening and we're registering events over time.) If we wanted two separate plots, all we have to is figure out when, exactly, that boring stretch of time starts and when, exactly, it ends. And, one way to do that would be if we could somehow generate the time that elapses between each successive event in our dataset. So let's do that.
We're going to compute the time difference between successive events, which I'm calling time deltas. It's worth taking a second to think about how the delta is defined:
For the $i$th event, the $\Delta$ value is given by the simple equation below, where $t_i$ is the timestamp of the $i$th event:
$\Delta = t_i - t_{i-1}$
As a result, the very first event $(i = 0)$ in a series will have a diff value of NaN
or NaT
(Not a Time), because the -1st event is undefined. But the second event will have a diff value of (Time of Second event - Time of First event). The very last event will also have a value: (Time of Last event - Time of Penultimate event).
The handy thing about pandas is that every data series (and a column counts as data series) has a diff()
method, which does exactly what we want: it computes successive pairwise differences between events. If we apply the .diff()
function to our timestamps, we'll have exactly what we want: a column of numbers where each number represents the time elapsed since the event that came before.
In [9]:
connectionEvents['delta1'] = connectionEvents.timestamp1.diff()
And, now that we have our column of deltas, how can we figure out where the big boring part starts? Well, the big boring part lasts a long time: multiple hours. So, what we're looking for is a huge time delta. Say a delta of more than five hours.
In [10]:
connectionEvents[connectionEvents.delta1 > np.timedelta64(5, 'h')]['timestamp']
Out[10]:
In [11]:
whenBoringPartEnds = 1400077592798
day1 = connectionEvents[connectionEvents.timestamp < whenBoringPartEnds]
day2 = connectionEvents[connectionEvents.timestamp >= whenBoringPartEnds]
# Creating the day1 plot
p = ggplot(aes(x='timestamp1',
y='cumulativeCount'),
data=day1)
p = p + geom_line()
p = p + ggtitle('Cumulative Distribution of MakeConnectComponent Events\nDay1')
p = p + xlab('Time')
p = p + ylab('Event Count')
# Showing the plot
print(p)
# Saving the plot
ggsave(plot=p,
filename='cumulativeDistributionDay1.png')
# Creating the day2 plot
p = ggplot(aes(x='timestamp1',
y='cumulativeCount'),
data=day2)
p = p + geom_line()
p = p + ggtitle('Cumulative Distribution of MakeConnectComponent Events\nDay2')
p = p + xlab('Time')
p = p + ylab('Event Count')
# Showing the plot
print(p)
# Saving the plot
ggsave(plot=p,
filename='cumulativeDistributionDay2.png')
For each subsequent event where the timestamp is the same, we get a zero delta. So, if four successive events all share the same timestamp, they culminate in a fourth event whose delta3
is zero.
Let's do a little inspection and find events whose third and fourth order deltas are zero. We'll use the pandas method diff()
, which takes an array of data and computes the difference between contiguous pairwise elements.
Our code below computes successive deltas (up to fourth-order), and assigns each array of deltas to its own column in the dataframe.
In [12]:
ms['delta1'] = ms.timestamp.diff()
ms['delta2'] = ms.delta1.diff()
ms['delta3'] = ms.delta2.diff()
ms['delta4'] = ms.delta3.diff()
# A boolean expression to select events where deltas 1–3 are all zero
thirdOrderZeroes = (ms.delta3 == 0) & (ms.delta2 == 0) & (ms.delta1 == 0)
# The columns we'll want to view
columns = ['key', 'timestamp', 'delta1', 'delta2', 'delta3', 'delta4']
ms[thirdOrderZeroes][columns]
Out[12]:
In [13]:
ms[2470:2485][columns]
Out[13]:
FishCapture
events are followed by DisconnectComponent
events that almost all share the same timestampIt looks like event 2482 is part of a sequence: a fish was captured at 2479, and (because when a fish gets captured it blows out a circuit) a number of components got disconnected in events 2480-2484.
Note that not all the disconnect events share the same timestamp.
I mentioned in the last section that there's a huge problem with our data. Let's take a look again at the counts of different types of events, but this time we'll focus on the top five most frequent events:
In [14]:
topFiveMostFrequentEvents = ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)[:5]
topFiveMostFrequentEvents['timestamp']
Out[14]:
In our implementation of events, MakeConnectComponent
should be triggered once the system detects that two circuit elements have become connected (say, when a player bumps a resistor and a battery together). MakeDisconnectComponent
should be triggered once the system detects that two connected elements have become disconnected (say, when a player swipes a finger across a wire to cut it.)
That leads to Problem 1
And not just more, way more. Almost three times more. If you think about it, this doesn't make sense at all.
In our game, each block has a lone positive terminal and a lone negative terminal. And, each terminal only accepts a maximum of one connection for simplicity. When players add blocks to the table, the blocks don't start as being not connected to anything. The first MakeConnectComponent
event should happen when two free terminals from two different blocks get bumped together. And, if two terminals are connected, they can't get connected to other things without getting disconnected first. So, we should expect at least something close to parity: there should be about as many disconnect events as there are connect events. Otherwise, how can a bunch of connected terminals keep connecting to other things? (Again, remember that each terminal should accept a maximum of one connection.)
So, that's pretty weird. And also possibly very bad. But, it gets even worse when you consider Problem 2.
In our game, we built an event that takes a snapshot of the entire state of the board at regular intervals. We did that because we knew not all actions in the game should generate events. For example, if we recorded every single time any block changed its position, that would be way too much data. On the other hand, we need to know when players move blocks even if those movements don't generate big game events (like completing a circuit). So, we compromised. Every second, the game stores a snapshot of information about the state of the board, and that event is called MakeSnapshot
.
If you look at the table above (or the bar chart in our previous section), you'll notice that MakeSnapshot
comes in second in our Top 5 Most Frequent Events. And, it's not a small margin, either. MakeConnectComponent
is beating it by 25%. That is a Very Weird Thing.
If we assume that the snapshots are reliably firing every second, that means that On average, the system is registering block-to-block connections more than once per second. If all that data were user-generated, that means that over a period of about 44 total minutes of gameplay, kids were connecting blocks at a rate of more than one block per second, every second for the entirety of the 44 minutes.
To put that in perspective one more way:
Rather than assuming children are connecting circuit elements at a ludicrously high rate (Problem 2), which also would seem physically impossible (Problem 1), it seems more likely that there's a problem with the game's data logging that's revealed by our data. So, let's go to the next section and explore a method for checking it out.
In [15]:
topFive = loadDataSortedByTimestamp(filepath)
topFiveMostFrequentEvents = list(ms.groupby('key').count().sort(columns=['timestamp']).index)[-5:]
frequencyFilter = topFive.key.apply(lambda x: x in topFiveMostFrequentEvents)
topFive = topFive[frequencyFilter]
topFive['delta1'] = topFive.timestamp.diff()
binBreaks = [-1, 1, 50, 100, 200, 300, 500, 1000]
# binBreaks = [1000, 2000, 3000, 4000, 5000]
p = ggplot(aes(x='delta1',
fill='key'),
data=topFive) + \
geom_histogram(breaks=binBreaks) + \
scale_x_continuous(breaks=binBreaks) + \
ggtitle('Distribution of Time Deltas Between Successive Events') + \
ylab('Number of Events') + \
xlab('Time Between Events (ms)')
# print(p)
# ggsave(p, "histogram.png")
# topFive.head(n = 20)[['timestamp', 'delta1', 'key']]
print(p)
So, we know the distribution of MakeConnectComponent
events is uneven. We know that because in the last section we plotted the cumulative distribution function and saw it had a widely variable slope. What I'd like to do now is get an idea of just what that distribution of elapsed event times looks like.
To do that, we're going to use a visualization called a kernel density estimate. Essentially, what we're doing is creating a smoothed empirical approximation of what the distributions of events look like.
We're also going to use another powerful feature of graphical analysis: what statistician [Bill Cleveland][http://cm.bell-labs.com/cm/ms/departments/sia/wsc/] and Edward Tufte call "small multiples." We're actually going to take a look at the distributions of time deltas for the top 5 most frequent kinds of events and graphically compare them.
In [16]:
topFive = loadDataSortedByTimestamp(filepath)
topFiveMostFrequentEvents = list(ms.groupby('key').count().sort(columns=['timestamp']).index)[-5:]
frequencyFilter = topFive.key.apply(lambda x: x in topFiveMostFrequentEvents)
topFive['delta1'] = topFive.timestamp.diff()
topFive = topFive[frequencyFilter]
p = ggplot(aes(x = 'delta1',
group='key'),
data=topFive)
p = p + geom_density() # a Kernel Density Estimate
p = p + scale_x_continuous(limits=[-1000, 20000])
p = p + facet_wrap(y='key',
ncol=1,
scales='fixed')
p = p + xlab('Time Between Successive Events (ms)')
p = p + ggtitle('Smoothed Kernel Density Estimates')
print(p)
ggsave(plot=p,
filename='kernelDensityEstimate.png')
In [16]:
In [17]:
connections = loadDataSortedByTimestamp(filepath)
connections = connections[connections.key == 'MakeConnectComponent']
connections['delta1'] = connections.timestamp.diff()
binBreaks = [0, 50, 100, 200, 300, 400, 500, 600, 700, 800, 900, 1000]
p = ggplot(aes(x='delta1',
fill='key'),
data=connections) + \
geom_histogram(breaks=binBreaks) + \
scale_x_continuous(breaks=binBreaks) + \
ggtitle('Distribution of Time Deltas Between Successive Events') + \
ylab('Number of Events') + \
xlab('Time Between Events (ms)')
print(p)
ggsave(plot=p,
filename='histogram2.png')
In [18]:
topFive[topFive.key == 'MakeDisconnectComponent'][['key', 'delta1']].head()
print(topFive[topFive.key == 'MakeDisconnectComponent']['delta1'].describe())
topFive[topFive.key == 'MakeDisconnectComponent']['delta1'].plot(kind='kde')
# print(p)
# ggsave(p, "histogram.png")
# topFive.head(n = 20)[['timestamp', 'delta1', 'key']]
Out[18]:
In [19]:
connects = loadDataSortedByTimestamp(filepath)
connects = connects[connects.key == 'MakeConnectComponent']
connects['delta1'] = connects.timestamp.diff()
p = ggplot(aes(x='delta1',
fill='key'),
data=connects) + \
geom_histogram(breaks=binBreaks) + \
scale_x_continuous(breaks=binBreaks) + \
ggtitle('Distribution of Time Deltas Between Successive Events') + \
ylab('Number of Events') + \
xlab('Time Between Events (ms)')
print(p)
In [20]:
len(connections[connections.delta1 <= 1000]) # 1744 events
Out[20]:
In [21]:
columns = ['timestamp', 'key']
ms.groupby('key').count().sort(columns=['timestamp'], ascending=False)[columns]
Out[21]:
In [22]:
# We can also see what this looks like as a plot
msdata = ms.groupby('key').count().sort(columns=['timestamp', 'key'], ascending=False)
p = msdata['timestamp'].plot(kind='bar')
print(p)
pl.savefig("barChart.jpg",
dpi=300,
figsize=(8, 11),
bbox_inches='tight')
From our meeting on 2014-05-30 Allison confirmed that MakeConnectComponet
always generates a component_list
of the two blocks being connected.
Matthew also said that every time a connection is established between blocks A and B, it generates two MakeConnectComponent
events:
So, I tried investigating whether all MakeDisconnectComponent
events generate a component_list
with just two components.
But, my first query below failed, because it seems some events have null values for component_list
:
(ms[ms.key == 'MakeDisconnectComponent']['component_list']).apply(lambda x: len(x))
So, let's try another tactic. First, we'll create a filter that returns null values for component_lists.
In [23]:
nullComponentLists = pd.isnull(ms['component_list'])
ms[nullComponentLists][ms.key == 'MakeDisconnectComponent'][['key', 'component_list']]
Out[23]:
Event 980 is the first disconnect event where there is no component_list
, so let's check out what's happening in the neighborhood of that event
In [24]:
ms[970:986][['key', 'timestamp']]
Out[24]:
So, it looks like events 980, 981, and 982 all share the exact same timestamp (down to the millisecond), with one final disconnection event happening only about 15ms later.
In [25]:
ms[980:984][['key', 'timestamp', 'delta1']]
Out[25]:
We can try to interrogate what's happening by looking at the component lists. First, let's look inside the circuit that was created at event 975. I'm calling list()
on it because I kind of don't understand how else to get pandas to get me the output format I need to inspect :-)
In [26]:
list(ms[975:978].component_list)
Out[26]:
So, the circuit is a 3-block circuit. It has:
After a fish hits that circuit, it fries all the virtual wires (because the fish are biolectric), so we should expect to see 3 disconnect events:
In [27]:
list(ms[980:984].component_list)
Out[27]:
In [28]:
ms[980:984].timestamp
Out[28]:
In [28]: